A step-by-step guide for constructing an RSQLite database using Airbnb data.
For this project, we demonstrate how to separate, clean and upload host and listing data taken from Airbnb for the city of Bangkok in Thailand.
The following libraries are used for data cleaning and database construction
Begin by loading the downloaded data into R using the read_csv() command from the readr library.
data <- read_csv("../data/listings.csv.gz")
We also need to form a database connection object:
While we are actively working on our database construction, we may find it useful to periodically disconnect and reconstruct our database. Having to close R and delete the airdb.SQLite file can be tedious, to get around this, we use the following remove_live_database() function to disconnect and current working database while R is still open:
remove_live_database <- function(con){
if(file.exists("airdb.SQLite")){
if (exists("con")){
dbDisconnect(con)
}
file.remove("airdb.SQLite")
}
}
The data used in for this project is provided by Airbnb and can be found here. To get started, first download the listings.csv.gz from the previous link. In order to help streamline the process of constructing our database, it is beneficial to split the .csv file into a listing table and a host_info table. View the E/R (Entity/Relationship) diagram in the next section, to see how these two tables relate.
From here, we will work on separating and cleaning the two respective tables prior to insertion in our database.
We can examine our initial data using the glimpse() command from the tibble library.
Rows: 17,040
Columns: 74
$ id <dbl> 27934, 27979, 2~
$ listing_url <chr> "https://www.ai~
$ scrape_id <dbl> 2.021122e+13, 2~
$ last_scraped <date> 2021-12-24, 20~
$ name <chr> "Nice room with~
$ description <chr> "Our cool and c~
$ neighborhood_overview <chr> "It is very cen~
$ picture_url <chr> "https://a0.mus~
$ host_id <dbl> 120437, 120541,~
$ host_url <chr> "https://www.ai~
$ host_name <chr> "Nuttee", "Emy"~
$ host_since <date> 2010-05-08, 20~
$ host_location <chr> "Bangkok", "Ban~
$ host_about <chr> "Hi All, I am n~
$ host_response_time <chr> "N/A", "N/A", "~
$ host_response_rate <chr> "N/A", "N/A", "~
$ host_acceptance_rate <chr> "N/A", "N/A", "~
$ host_is_superhost <lgl> FALSE, FALSE, F~
$ host_thumbnail_url <chr> "https://a0.mus~
$ host_picture_url <chr> "https://a0.mus~
$ host_neighbourhood <chr> "Victory Monume~
$ host_listings_count <dbl> 2, 2, 1, 1, 1, ~
$ host_total_listings_count <dbl> 2, 2, 1, 1, 1, ~
$ host_verifications <chr> "['email', 'pho~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRU~
$ host_identity_verified <lgl> TRUE, FALSE, FA~
$ neighbourhood <chr> "Samsen Nai, Ba~
$ neighbourhood_cleansed <chr> "Ratchathewi", ~
$ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA,~
$ latitude <dbl> 13.75983, 13.66~
$ longitude <dbl> 100.5413, 100.6~
$ property_type <chr> "Entire condomi~
$ room_type <chr> "Entire home/ap~
$ accommodates <dbl> 3, 2, 2, 2, 2, ~
$ bathrooms <lgl> NA, NA, NA, NA,~
$ bathrooms_text <chr> "1.5 baths", "1~
$ bedrooms <dbl> 1, 1, 1, 1, 1, ~
$ beds <dbl> 1, 2, 1, 1, 1, ~
$ amenities <chr> "[\"Wifi\", \"H~
$ price <chr> "$1,845.00", "$~
$ minimum_nights <dbl> 3, 1, 60, 5, 1,~
$ maximum_nights <dbl> 90, 730, 730, 3~
$ minimum_minimum_nights <dbl> 3, 1, 60, 5, 1,~
$ maximum_minimum_nights <dbl> 3, 1, 60, 5, 1,~
$ minimum_maximum_nights <dbl> 90, 730, 730, 3~
$ maximum_maximum_nights <dbl> 90, 730, 730, 3~
$ minimum_nights_avg_ntm <dbl> 3, 1, 60, 5, 1,~
$ maximum_nights_avg_ntm <dbl> 90, 730, 730, 3~
$ calendar_updated <lgl> NA, NA, NA, NA,~
$ has_availability <lgl> TRUE, TRUE, TRU~
$ availability_30 <dbl> 27, 23, 29, 11,~
$ availability_60 <dbl> 57, 53, 59, 11,~
$ availability_90 <dbl> 87, 83, 89, 11,~
$ availability_365 <dbl> 362, 358, 364, ~
$ calendar_last_scraped <date> 2021-12-24, 20~
$ number_of_reviews <dbl> 65, 0, 0, 1, 0,~
$ number_of_reviews_ltm <dbl> 0, 0, 0, 0, 0, ~
$ number_of_reviews_l30d <dbl> 0, 0, 0, 0, 0, ~
$ first_review <date> 2012-04-07, NA~
$ last_review <date> 2020-01-06, NA~
$ review_scores_rating <dbl> 4.85, NA, NA, 4~
$ review_scores_accuracy <dbl> 4.95, NA, NA, 5~
$ review_scores_cleanliness <dbl> 4.81, NA, NA, 3~
$ review_scores_checkin <dbl> 4.97, NA, NA, 2~
$ review_scores_communication <dbl> 4.91, NA, NA, 4~
$ review_scores_location <dbl> 4.66, NA, NA, 4~
$ review_scores_value <dbl> 4.75, NA, NA, 5~
$ license <lgl> NA, NA, NA, NA,~
$ instant_bookable <lgl> FALSE, FALSE, F~
$ calculated_host_listings_count <dbl> 2, 2, 1, 1, 1, ~
$ calculated_host_listings_count_entire_homes <dbl> 2, 1, 0, 0, 0, ~
$ calculated_host_listings_count_private_rooms <dbl> 0, 1, 1, 1, 1, ~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, ~
$ reviews_per_month <dbl> 0.55, NA, NA, 0~
Before we separate the data into two separate tables, we perform the following preprocessing steps:
date to the type character to help process NA values.NA representations (blanks, None, N/A, NA) to NA.data.data <- data %>%
# (1) Convert dates to characters for NA values
mutate(last_scraped = as.character(last_scraped),
host_since = as.character(host_since),
calendar_last_scraped = as.character(calendar_last_scraped),
first_review = as.character(first_review),
last_review = as.character(last_review),
) %>%
# (2) Homogenize NA values
#*# Taken from: https://rpubs.com/Argaadya/create_table_sql
mutate_all(function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x)) %>% #*#
# mutate_all(function(x) ifelse(is.na(x), "NULL", x)) %>%
# (3) Convert character strings back to date type
mutate(last_scraped = as.Date(last_scraped),
host_since = as.Date(host_since),
calendar_last_scraped = as.Date(calendar_last_scraped),
first_review = as.Date(first_review),
last_review = as.Date(last_review))
Since a host can have many listings, it’s beneficial to split our initial data into two tables before inserting into our database. The first table that we create, is the host_info table with the same attributes as shown in the E/R Diagram above.
We are now ready to extract and clean data from our initial data table in order to construct our host_data table. This is done and the following four steps:
data table. Note that we use the : syntax to grab many columns at once. This command syntax is inclusive.distinct() function from the dplyr library.host_since column back to the type character. This is required since RSQLite does not support data of the type date. We can convert back to the correct type when performing queries.str_remove_all() function from the stringr library to convert the host_verifications sublists into simple strings. e.g. “[‘email’, ‘phone’]” becomes “email, phone”. # (1) Extract host data
host_data <- data %>%
select(host_id:host_identity_verified,
calculated_host_listings_count:calculated_host_listings_count_shared_rooms)
# (2) Remove duplicate values
host_data <- host_data %>% distinct()
# (3) Convert dates
# Note that this will need to converted back to type = date for analysis
host_data <- host_data %>% mutate(host_since = as.character(host_since))
# (4) Clean host verification column
host_data <-
host_data %>%
mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))
We can now view our clean data:
rmarkdown::paged_table(host_data)
In order to facilitate the insertion of data into our RSQLite database, we the following insert_to_sql() function.
This function performs a number of operations prior to insertion:
data argument.NA out values are in the correct form. For all non NA values, replace the double quotes with single quotes and bookend strings containing multiple quotes with double quotes.NA values to NULL or data insertion and trim all additional whitespace.\\’s. Finally, use the paste0() function to convert our string into a query prior to insertion in our database’s corresponding relevant table.insert_to_sql <- function(con, table, data){
# (1)
column_name <- paste(names(data), collapse = ", ")
# (2)
data_new <- data %>%
mutate_if(is.character, function(x) ifelse(is.na(x), NA, x %>%
str_replace_all('"', "'") %>% # Replace " with '
paste0('"', . , '"') # Add " before and after string
)
)
value_data <- apply(data_new, MARGIN = 1,
function(x) x %>%
paste(collapse = ",") %>% # Join all column into single string
str_replace_all("\\bNA\\b", "NULL") %>% # Create NULL from NA
str_trim() # remove unnecessary whitespace
)
# (4)
query_value <- paste(value_data) %>%
paste0("(", ., ")") %>% # Add bracket before and after string
paste(collapse = ", ") %>% # Join all values into single string
str_remove_all("\\\\") %>% # Remove \\ from string
paste0("INSERT INTO ", table, "(", column_name, ") VALUES ", .)
# (5)
dbSendQuery(con, query_value)
}
Now that our host_data table is clean. We can create the equivalent table as a query, initially as a string listing the table’s columns, before creating the empty table in our database using our con object:
#################### Create table for host info
query <- "CREATE TABLE host_info(
host_id INT,
host_url VARCHAR(50),
host_name VARCHAR(100),
host_since VARCHAR(50),
host_location VARCHAR(500),
host_about VARCHAR(10000),
host_response_time VARCHAR(50),
host_response_rate VARCHAR(50),
host_acceptance_rate VARCHAR(50),
host_is_superhost BOOLEAN,
host_thumbnail_url VARCHAR(500),
host_picture_url VARCHAR(500),
host_neighbourhood VARCHAR(50),
host_listings_count INT,
host_total_listings_count INT,
host_verifications VARCHAR(500),
host_has_profile_pic BOOLEAN,
host_identity_verified BOOLEAN,
calculated_host_listings_count INT,
calculated_host_listings_count_entire_homes INT,
calculated_host_listings_count_private_rooms INT,
calculated_host_listings_count_shared_rooms INT,
PRIMARY KEY(host_id)
)"
Now create the empty table in our database.
dbSendQuery(con, query)
<SQLiteResult>
SQL CREATE TABLE host_info(
host_id INT,
host_url VARCHAR(50),
host_name VARCHAR(100),
host_since VARCHAR(50),
host_location VARCHAR(500),
host_about VARCHAR(10000),
host_response_time VARCHAR(50),
host_response_rate VARCHAR(50),
host_acceptance_rate VARCHAR(50),
host_is_superhost BOOLEAN,
host_thumbnail_url VARCHAR(500),
host_picture_url VARCHAR(500),
host_neighbourhood VARCHAR(50),
host_listings_count INT,
host_total_listings_count INT,
host_verifications VARCHAR(500),
host_has_profile_pic BOOLEAN,
host_identity_verified BOOLEAN,
calculated_host_listings_count INT,
calculated_host_listings_count_entire_homes INT,
calculated_host_listings_count_private_rooms INT,
calculated_host_listings_count_shared_rooms INT,
PRIMARY KEY(host_id)
)
ROWS Fetched: 0 [complete]
Changed: 0
Next, check the schema of our database so far.
res <- dbSendQuery(con, "PRAGMA table_info([host_info]);")
fetch(res)
cid name type
1 0 host_id INT
2 1 host_url VARCHAR(50)
3 2 host_name VARCHAR(100)
4 3 host_since VARCHAR(50)
5 4 host_location VARCHAR(500)
6 5 host_about VARCHAR(10000)
7 6 host_response_time VARCHAR(50)
8 7 host_response_rate VARCHAR(50)
9 8 host_acceptance_rate VARCHAR(50)
10 9 host_is_superhost BOOLEAN
11 10 host_thumbnail_url VARCHAR(500)
12 11 host_picture_url VARCHAR(500)
13 12 host_neighbourhood VARCHAR(50)
14 13 host_listings_count INT
15 14 host_total_listings_count INT
16 15 host_verifications VARCHAR(500)
17 16 host_has_profile_pic BOOLEAN
18 17 host_identity_verified BOOLEAN
19 18 calculated_host_listings_count INT
20 19 calculated_host_listings_count_entire_homes INT
21 20 calculated_host_listings_count_private_rooms INT
22 21 calculated_host_listings_count_shared_rooms INT
notnull dflt_value pk
1 0 NA 1
2 0 NA 0
3 0 NA 0
4 0 NA 0
5 0 NA 0
6 0 NA 0
7 0 NA 0
8 0 NA 0
9 0 NA 0
10 0 NA 0
11 0 NA 0
12 0 NA 0
13 0 NA 0
14 0 NA 0
15 0 NA 0
16 0 NA 0
17 0 NA 0
18 0 NA 0
19 0 NA 0
20 0 NA 0
21 0 NA 0
22 0 NA 0
dbClearResult(res)
Finally, insert our host_data table into the equivalent table in our RSQLite database using the insert_to_sql() function as defined above.
insert_to_sql(con, "host_info", host_data)
We can verify the contents of our newly created and populated RSQLite database table host_info as follows:
res <- dbSendQuery(con, "SELECT * FROM host_info LIMIT 10")
out_db <- fetch(res)
dbClearResult(res)
rmarkdown::paged_table(out_db)